ICM/Message Queue Failure
~Julia Fezveluburjip 12/17/2003 03:03 PM
Domino Server 6.0.2 CF2 Solaris

We are running a Domino cluster on two machines, with the ICM running on it own separate partitioned server on one of the machines. Since upgrading to Domino 6 & Solaris 9 the ICM regularly keels over and dies (well, it stops responding and the partioned Domino server refuses connections). To add to the agony, running NSD -kill usually fails to identify the processes attached to this server instance and I have to resort to using pkill -9 -u to stop it.

I have, along with resident Solaris experts, spent many hours trying to track down the problem. I have discovered the following (altho' I dont know if it's related):
1. The fault recovery log for the ICM server is full of entries that read "[13] ERROR: Message queue failure: REMOVE ITEM" at a frequency of about three a minute. These generally start within about 30 minutes, or much less, of the other Domino server on the same machine starting.
2. The other partitioned server on the same machine writes "[11] ERROR: Process being removed not in queue: xxxx" to the fault recovery log when it shuts down.
3. ipcs reveals that the message queue for the ICM server has vanished when the errors are appearing in its fault recovery log.
4. The ICM server regularly complains that another Domino process is sharing log.nsf, which it isn't as far as I can see.

We have the rlim_fd_max set at 65536 and msgtql at 1024 (and have tried 2048!).

Any ideas would be greatly appreciated.

